K8SPG-680: add `ReadyForBackup` condition to the `pg-cluster` #1133

pooknull · 2025-04-15T08:46:34Z

https://perconadev.atlassian.net/browse/K8SPG-680

DESCRIPTION

Problem:
After a failed PVC resize on cluster1-repo1, scheduled backups cannot be created successfully. Although the pg-backup object is created, it gets stuck in the Starting state.

Cause:
When a PVC resize fails, the crunchy's PostgresCluster resource gets an Unknown status for the PGBackRestReplicaRepoReady condition. This condition is required to create a backup job in the reconcileManualBackup method:

percona-postgresql-operator/internal/controller/postgrescluster/pgbackrest.go

Lines 2394 to 2407 in bd5bac1

    
           // determine if the dedicated repository host is ready using the repo host ready 
        
           // condition, and return if not 
        
           repoCondition := meta.FindStatusCondition(postgresCluster.Status.Conditions, ConditionRepoHostReady) 
        
           if repoCondition == nil || repoCondition.Status != metav1.ConditionTrue { 
        
           	return nil 
        
           } 
        
           // Determine if the replica create backup is complete and return if not. This allows for proper 
        
           // orchestration of backup Jobs since only one backup can be run at a time. 
        
           backupCondition := meta.FindStatusCondition(postgresCluster.Status.Conditions, 
        
           	ConditionReplicaCreate) 
        
           if backupCondition == nil || backupCondition.Status != metav1.ConditionTrue { 
        
           	return nil 
        
           }

As a result, the operator waits indefinitely for the backup job to appear:

percona-postgresql-operator/percona/controller/pgbackup/controller.go

Lines 190 to 195 in bd5bac1

    
           if errors.Is(err, ErrBackupJobNotFound) { 
        
           	log.Info("Waiting for backup to start") 
        
           	return reconcile.Result{RequeueAfter: time.Second * 5}, nil 
        
           } 
        
           return reconcile.Result{}, errors.Wrap(err, "find backup job")

Solution:

Add a new .status.conditions field to the PerconaPGCluster resource.
If the required conditions in the PostgresCluster resource (PGBackRestRepoHostReady and PGBackRestReplicaCreate) are not True, a new ReadyForBackup condition is added to PerconaPGCluster with the False status.
If ReadyForBackup is False, the operator will skip the scheduled backup creation and log a message instead.
When a new PerconaPGBackup resource is created and the operator is waiting for its backup job to appear, it will check the ReadyForBackup condition. If it was set to False more than 2 minutes ago, the backup will be marked as Failed.

CHECKLIST

Jira

Is the Jira ticket created and referenced properly?
Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

Is an E2E test/test case added for the new feature/change?
Are unit tests added where appropriate?

Config/Logging/Testability

Are all needed new/changed options added to default YAML files?
Are all needed new/changed options added to the Helm Chart?
Did we add proper logging messages for operator actions?
Did we ensure compatibility with the previous version or cluster upgrade process?
Does the change support oldest and newest supported PG version?
Does the change support oldest and newest supported Kubernetes version?

https://perconadev.atlassian.net/browse/K8SPG-680

gkech · 2025-04-17T10:09:33Z

percona/controller/pgbackup/testutils_test.go

+func (f *fakeClient) Patch(ctx context.Context, obj client.Object, patch client.Patch, options ...client.PatchOption) error {
+	err := f.Client.Patch(ctx, obj, patch, options...)
+	if !k8serrors.IsNotFound(err) {
+		return err
+	}
+	if err := f.Create(ctx, obj); err != nil {
+		return err
+	}
+	return f.Client.Patch(ctx, obj, patch, options...)
+}


Do we need this? By removing it nothing fails on the controller tests.

gkech · 2025-04-17T10:22:20Z

percona/controller/pgbackup/controller.go

@@ -505,7 +470,7 @@ func updatePGBackrestInfo(ctx context.Context, c client.Client, pod *corev1.Pod,
 }

 func finishBackup(ctx context.Context, c client.Client, pgBackup *v2.PerconaPGBackup, job *batchv1.Job) (*reconcile.Result, error) {
-	if checkBackupJob(job) == v2.BackupSucceeded {
+	if job != nil && checkBackupJob(job) == v2.BackupSucceeded {


Should we maybe validate the input job once at the top of the function and avoid repeating the same check across different places?

e.g.

func finishBackup(ctx context.Context, c client.Client, pgBackup *v2.PerconaPGBackup, job *batchv1.Job) (*reconcile.Result, error) { if job == nil { // do something }

It will be clearer to repeat this check. We can't change the order of each action in this function, so if we add the if job == nil check at the top, it will just have a lot of duplicated code.

pooknull added 4 commits April 15, 2025 11:46

K8SPG-680: add ReadyForBackup condition to the pg-cluster

fd91c63

https://perconadev.atlassian.net/browse/K8SPG-680

add unit-tests

5f50dd1

fix lint

6dc1f34

Merge branch 'main' into K8SPG-680

05d1788

pooknull marked this pull request as ready for review April 16, 2025 11:18

pooknull requested review from hors, egegunes, nmarukovich and gkech as code owners April 16, 2025 11:18

pooknull added 2 commits April 16, 2025 14:21

fix lint

f02c6a9

Merge branch 'main' into K8SPG-680

11d899c

gkech reviewed Apr 17, 2025

View reviewed changes

improve unit-test

3af5242

gkech approved these changes Apr 18, 2025

View reviewed changes

egegunes approved these changes Apr 18, 2025

View reviewed changes

Merge branch 'main' into K8SPG-680

54bde81

hors approved these changes Apr 18, 2025

View reviewed changes

hors merged commit 56cbc36 into main Apr 18, 2025
15 of 16 checks passed

hors deleted the K8SPG-680 branch April 18, 2025 09:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

K8SPG-680: add `ReadyForBackup` condition to the `pg-cluster` #1133

K8SPG-680: add `ReadyForBackup` condition to the `pg-cluster` #1133

Uh oh!

pooknull commented Apr 15, 2025 •

edited

Loading

Uh oh!

gkech Apr 17, 2025

Uh oh!

pooknull Apr 17, 2025

Uh oh!

gkech Apr 17, 2025 •

edited

Loading

Uh oh!

pooknull Apr 17, 2025

Uh oh!

Uh oh!

Uh oh!

	// determine if the dedicated repository host is ready using the repo host ready
	// condition, and return if not
	repoCondition := meta.FindStatusCondition(postgresCluster.Status.Conditions, ConditionRepoHostReady)
	if repoCondition == nil \|\| repoCondition.Status != metav1.ConditionTrue {
	return nil
	}

	// Determine if the replica create backup is complete and return if not. This allows for proper
	// orchestration of backup Jobs since only one backup can be run at a time.
	backupCondition := meta.FindStatusCondition(postgresCluster.Status.Conditions,
	ConditionReplicaCreate)
	if backupCondition == nil \|\| backupCondition.Status != metav1.ConditionTrue {
	return nil
	}

	if errors.Is(err, ErrBackupJobNotFound) {
	log.Info("Waiting for backup to start")

	return reconcile.Result{RequeueAfter: time.Second * 5}, nil
	}
	return reconcile.Result{}, errors.Wrap(err, "find backup job")

K8SPG-680: add ReadyForBackup condition to the pg-cluster #1133

K8SPG-680: add ReadyForBackup condition to the pg-cluster #1133

Uh oh!

Conversation

pooknull commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

DESCRIPTION

CHECKLIST

Uh oh!

gkech Apr 17, 2025

Choose a reason for hiding this comment

Uh oh!

pooknull Apr 17, 2025

Choose a reason for hiding this comment

Uh oh!

gkech Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pooknull Apr 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

K8SPG-680: add `ReadyForBackup` condition to the `pg-cluster` #1133

K8SPG-680: add `ReadyForBackup` condition to the `pg-cluster` #1133

pooknull commented Apr 15, 2025 •

edited

Loading

gkech Apr 17, 2025 •

edited

Loading